What is Generative AI?

Before we build with it — let’s define it.

Generative AI in One Sentence

AI that generates new content — text, images, code, audio — by learning patterns from existing data.

It doesn’t retrieve. It doesn’t look things up. It generates.

The Key Distinction

A search engine finds content that already exists. A generative model creates content that may have never existed before.

What Can It Generate?

Text

  • Answers, summaries, essays
  • Code in any language
  • Translations, rewrites

Images

  • Photos, illustrations, diagrams
  • UI mockups, product designs

Audio & Video

  • Voice synthesis, music
  • Video generation

Structured data

  • JSON, SQL, tables
  • Function calls, tool outputs

This Course

We focus on text generation — the foundation of every AI agent, chatbot, and assistant you’ll build.

Agenda

  • A. From Words to Intelligence — the evolution that made GenAI possible
  • B. The ChatGPT Revolution — SFT turned completers into assistants
  • C. In-Context Learning — no retraining, just better prompts
  • D. Thinking Models — from CoT to o1 and beyond
  • E. Under the Hood — tokens, probability, sampling controls, and what goes wrong
  • F. Key Intuitions — what you know, what’s next

A. From Words to Intelligence

A conceptual tour of the most important ideas.

It Started with Word Completion

The original goal was simple: predict the next word.

Statistical Language Models (N-Grams)

*“The capital of Saudi Arabia is ___“*

  • Riyadh (99.8%) ✅ - Jeddah (0.1%) - Paris (0.00001%)

The model looked at the words before it and counted: what word most often follows “capital of Saudi Arabia is” in our corpus?

The question became: can we teach a model to understand words — not just count them?

Evolution: Step 1 — Word Completion

graph LR
    A["🔤 Statistical LMs<br/>(N-Grams)<br/>Predict by frequency"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff

The machine learned to finish sentences — nothing more.

Words Got Meaning — Neural LMs

The next leap: instead of counting words, models learned to understand them.

Words became vectors — points in a mathematical space where meaning lives.

The Key Shift

Statistical LMs asked: “What word usually comes next?” (frequency)

Neural LMs asked: “What word means what comes next?” (understanding)

Now “king” and “queen” are close in that space. Arabic and English words for “house” are neighbors. Translation became possible.

More on this in RAG module.

Evolution: Step 2 — Meaning Enters

graph LR
    A["🔤 Statistical LMs<br/>(N-Grams)<br/>Predict by frequency"] --> B["🧠 Neural LMs<br/>Words as vectors<br/>Predict by meaning"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff

Same task. Richer representation. The machine finally understood relationships between words.

The Architecture Shift: Transformers (2017)

Before 2017, models read text sequentially — word by word, left to right. (LSTMs)

The Transformer (“Attention is All You Need”, Google 2017) changed everything:

  • Parallel processing: Read the entire document at once
  • Self-Attention: Every word looks at every other word to find relevance
  • Eliminated the bottleneck: Now we could finally train on internet-scale data

Why It Mattered

This is the “engine” inside every modern model — GPT, Claude, Gemini, Llama. Without it, scale was impossible.

The Scale Breakthrough — GPT & LLMs (2018–2020)

Once we had the Transformer, we just needed to scale it up.

  • GPT-1 (2018) proved it worked.
  • GPT-2 (2019) proved it could write believable essays.
  • GPT-3 (2020) proved that Scale is a quality of its own.
Model Year Parameters
GPT-1 2018 117M
GPT-2 2019 1.5B
GPT-3 2020 175B

The Revelation

At 175B parameters, GPT-3 could translate, summarize, write code, and answer questions — without being explicitly trained for any of them.

Evolution: Step 3 — Scale Changed Everything

graph LR
    A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
    B --> T["⚡ Transformers<br/>(2017)<br/>Parallel Attention"]
    T --> C["🚀 LLMs / GPT<br/>(2018–2020)<br/>Scale + Emergence"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style T fill:#FFCC33,stroke:#1C355E,color:#333
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff

The task never changed. The scale did — and scale changed everything.

Completion as a Superpower

One mechanism. Many applications.

Prompt prefix What the model does
"Once upon a time" Continues as a story
"Translate to Arabic: Hello → " Completes the translation
"# Python function that reverses a string\ndef reverse(" Writes the code body

The Key Insight

Translation, summarization, and coding are all completion problems in disguise. The model doesn’t switch modes — it just completes the pattern.

Evolution: The Full Picture So Far

graph LR
    A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
    B --> T["⚡ Transformers<br/>2017"]
    T --> C["🚀 LLMs<br/>2020"]
    C --> D["💬 ChatGPT<br/>(2022)<br/>Instruction-following<br/>(coming up next)"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style T fill:#FFCC33,stroke:#1C355E,color:#333
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff
    style D fill:#1C355E,stroke:#00C9A7,color:#fff,stroke-dasharray: 5 5

(Dashed = foreshadowed — we’ll complete this story in Section C)

C. The ChatGPT Revolution

How SFT turned completers into assistants

The Problem with Raw Completers

Base GPT-2: “Write me a poem”

Write me a poem about the weather
in a different language. Write me a
poem about the weather in a...

It continues the sentence. It doesn’t understand you issued a command.

ChatGPT: “Write me a poem”

Here's a poem for you:

The morning light breaks soft and low,
Across the hills where rivers flow...

It treats your text as an instruction and responds accordingly.

What changed?

Not the underlying completion mechanism — the training data and process.

Supervised Fine-Tuning (SFT)

Human annotators wrote thousands of correct instruction-following responses. The model was then trained to mimic them.

[Prompt]   → "Summarize this article in 3 bullet points."
[Response] → "• Point 1\n• Point 2\n• Point 3"

[Prompt]   → "Translate 'Good morning' to French."
[Response] → "Bonjour"

The model learned a new pattern: prompts are commands, not text to continue.

Still completion under the hood

SFT didn’t change the architecture. It changed what the model learned to complete — now it completes helpful assistant responses.

RLHF in 30 Seconds

After SFT, a second step aligned the model with human preferences:

  1. Generate multiple responses to the same prompt
  2. Human rankers pick which response is better
  3. Train a reward model on those preferences
  4. Use the reward model to further tune the base model

One-sentence intuition

RLHF is how we teach the model to prefer helpful and honest completions over merely plausible ones.

The Model Spectrum

graph LR
    A["Base Model<br/>(raw completion)"] --> B["Instruction-Tuned<br/>(follows commands)"]
    B --> C["Chat-Optimized<br/>(multi-turn dialogue)"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff

Each step is an evolutionary refinement — same foundation, more alignment.

Evolution: The Full Picture (Updated)

graph LR
    A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
    B --> T["⚡ Transformers<br/>2017"]
    T --> C["🚀 LLMs<br/>2020"]
    C --> D["💬 ChatGPT<br/>(2022)<br/>SFT + RLHF<br/>Instruction-following"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style T fill:#FFCC33,stroke:#1C355E,color:#333
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff
    style D fill:#1C355E,stroke:#00C9A7,color:#fff

The dashed node is now complete. We have a full completer → assistant pipeline.

D. In-Context Learning

No retraining required — just better prompts

The Insight That Changed Everything

You don’t need to retrain the model. Just show it examples in the prompt.

GPT-3 (2020) demonstrated that a large enough model could adapt to a new task from a handful of examples — with no gradient updates, no fine-tuning, no cost.

This was called In-Context Learning (ICL), and it unlocked prompt engineering as a discipline.

Zero-Shot vs Few-Shot

Zero-Shot

Classify the sentiment:
"The service was terrible."

Sentiment:

Model output: Negative

Works for simple, well-defined tasks.

Few-Shot

"Great product!" → Positive
"Terrible service." → Negative
"It was okay." → Neutral

"Absolutely loved it!" →

Model output: Positive

Works for nuanced or custom formats.

Rule of thumb

When zero-shot gives inconsistent results, add 2–3 examples. That’s usually enough.

System Prompts as Configuration

The system prompt is the personality dial — it shapes every response without the user seeing it.

messages = [
    {
        "role": "system",
        "content": (
            "You are a concise financial analyst. "
            "Always cite numbers. Never speculate. "
            "Respond in bullet points."
        )
    },
    {"role": "user", "content": "What drove Apple's revenue in Q3?"}
]

Change the system prompt → change the model’s persona, tone, and constraints. No retraining needed.

Prompt Engineering as a Skill

Four levers you control

  1. Role assignment — “You are an expert in…”
  2. Structured format — JSON, markdown, bullet points
  3. Output constraints — “Respond in under 100 words”
  4. Examples — 2–3 in-context demonstrations

Prompt engineering is not a workaround — it’s the primary interface between your application and the model.

E. Thinking Models

From chain of thought to o1 and beyond

“Let’s Think Step by Step”

In 2022, researchers discovered a single phrase nearly doubled accuracy on math benchmarks:

Direct answer

Q: Roger has 5 tennis balls.
He buys 2 more cans of 3.
How many does he have?

A: 11

Wrong. (The model guessed.)

With Chain of Thought

Q: Roger has 5 tennis balls.
He buys 2 more cans of 3.
How many does he have?
Let's think step by step.

A: He starts with 5.
2 cans × 3 balls = 6 more.
5 + 6 = 11.

Correct — and verifiable.

Why It Works

Forcing the model to generate intermediate steps means:

  • More tokens generated before the final answer
  • More “computation” applied to the problem
  • Each step constrains what the next step can plausibly be

Intuition

The model doesn’t think before it writes — it thinks by writing. More tokens = more thinking.

The Thinking Model Era

graph LR
    A["Chain of Thought<br/>(2022, prompt trick)"] --> B["o1<br/>(2024, trained CoT)"]
    B --> C["DeepSeek-R1<br/>(2025, open weights)"]
    C --> D["Claude Extended<br/>Thinking (2025)"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff
    style D fill:#1C355E,stroke:#1C355E,color:#fff

The shift: from prompting a model to think step-by-step, to training a model that reasons internally before responding.

The Trade-off

Standard Model Thinking Model
Response time ~1 second ~15–30 seconds
Cost Low 3–10× higher
Best for Chat, summarization, extraction Math, coding, complex reasoning
Failure mode Hallucination Overthinking simple tasks

When to use thinking models

Use them when correctness matters more than speed — audit tasks, code generation, structured data extraction. Don’t use them for real-time chat.

Evolution: The Complete Story

graph LR
    A["🔤 Statistical LMs<br/>(N-Grams)"] --> B["🧠 Neural LMs<br/>Words as Vectors"]
    B --> T["⚡ Transformers<br/>(2017)"]
    T --> C["🚀 LLMs / GPT<br/>(2020)"]
    C --> D["💬 ChatGPT<br/>(2022)"]
    D --> E["🤔 Thinking Models<br/>(2024–2025)"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style T fill:#FFCC33,stroke:#1C355E,color:#333
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff
    style D fill:#1C355E,stroke:#00C9A7,color:#fff
    style E fill:#9B4DCA,stroke:#1C355E,color:#fff

From counting words to reasoning step-by-step — one continuous thread.

E. Under the Hood

The mechanics behind what you’ve been using all session

The Atomic Unit: Tokens

The machine doesn’t actually see “words.” It sees Tokens.

  • What are they? Fragments of words — sometimes parts of a word, sometimes a whole word.
  • Why? It’s more efficient than having a dictionary of every possible word.
  • The Result: The model can read and write words it has never seen before by combining fragments.

Rule of Thumb

1,000 tokens ≈ 750 words. When you’re billed for “input and output,” you’re paying for these atomic fragments.

The Probability Distribution

The model doesn’t “know” facts; it calculates probabilities.

“According to Article 214 of the…”

  • The (45%)
  • Foreign (30%)
  • Capital (15%)
  • Unicorn (0.01%)

The Decision

It doesn’t always pick the top token. We use Sampling Parameters to control how it chooses from this list.

Temperature: The “Creativity” Dial

Temperature adjusts the shape of that probability curve.

Low Temp (< 0.5) - Sharpened probabilities

High Temp (> 0.8) - Flattened probabilities

Top-K: The “Shortlist”

Top-K simply chops off the list after \(K\) items.

“According to Article 214…”

  • The (45%)
  • Foreign (30%)
  • Capital (15%)Cut off (if K=2)
  • Unicorn (0.01%)Cut off

Impact

Here, setting \(K=2\) forces it to stick to the very most likely words, preventing weird deviations like “Unicorn”.

Top-P (Nucleus): The “Smart” Cutoff

Top-P sums up probabilities until it hits \(P\) (e.g., 0.9).

Scenario A: The Law (Ambiguous)

  • The (45%) + Foreign (30%) + Capital (15%) = 90%
  • Result: All three are kept.

Scenario B: The Capital (Certain)

“The capital of Saudi Arabia is…”

  • Riyadh (99.8%)
  • Result: Hit 90% immediately. Stops there.

Why Top-P wins

It adapts to the context. It’s strict when the answer is clear (Riyadh), and flexible when it’s open-ended or we have synonyms.

Cheat Sheet: Configuration

Parameter Effect Recommended Value
Temperature Randomness 0.0 (Fact/Code) – 0.7 (Chat) – 1.0 (Creative)
Top-P Dynamic vocabulary 0.9 (Standard), 1.0 (Disable)
Top-K Hard vocabulary limit 40-100 (Standard)

Best Practice

Usually, you tune Temperature and Top-P. Leave Top-K alone or at default.

Hallucinations: When Guessing Fails

Hallucinations are structural. They aren’t a bug; they are the result of the model doing its job without enough context.

  1. Probability over Truth — It picks statistically likely words, not verified facts.
  2. Plausibility over Accuracy — It is trained to sound correct, which makes its lies dangerous.
  3. Training Gaps — If it hasn’t seen the “Saudi Investment Law,” it will confidently invent one that sounds legal.

Production Risk

A hallucinating chatbot creates real legal and financial liability. “Confident, fluent lies” are harder to spot than obvious errors.

Mitigation Strategies

Strategy How It Works When to Use
RAG Ground responses in retrieved documents Most enterprise apps (Module 03)
Output Validation Verify against structured data/rules Factual claims, numbers
Temp = 0 Force the most likely (factual) path extraction, classification
Human-in-the-Loop Human review before action Critical decisions

The Core Lesson

The same mechanism that writes a creative novel also writes a confident lie. Control the mechanism, or provide the data.

F. Key Intuitions

What you know and what’s next

Key Takeaways

  1. GenAI generates, not retrieves — it creates content by predicting what comes next, at scale.
  2. Completion is the primitive — translation, coding, and conversation all reduce to next-word prediction.
  3. Scale created emergence — GPT-3 wasn’t programmed to translate; it learned translation as a side effect of predicting text at 175B parameters.
  4. Hallucinations are structural — the model generates plausible text, not true text. Always plan a mitigation strategy.
  5. SFT aligned completers — ChatGPT didn’t invent a new architecture; it trained the model to treat prompts as commands.
  6. In-context learning — you can adapt a model to new tasks in real time, just by showing examples in the prompt.
  7. Thinking models trade speed for accuracy — chain of thought and o1-style reasoning give you better answers at higher cost and latency.

What We Intentionally Skipped

This session focused on intuition. We deliberately left out:

Skipped Why When you’ll need it
Transformer architecture You don’t need it to build Academic curiosity only
Attention math Same reason Deep ML research
KV-cache mechanics Not relevant yet Session 2 — it affects your API bill

The principle

Understand the behavior of the model before its internals. Once you’re building with the API, the internals become relevant — and we’ll cover them exactly when they matter.

Up Next

Lab 1: Hands-on tokenization and cost analysis — see how tokens, context windows, and costs connect in practice.